Now that we have a sense of what our data is like we can get started with data analysis.
OPTIONAL(but recommended): Running the preprocessing (to see it)
The next major function of the recipes package is prep().
This function updates the recipe object based on the training data. It estimates parameters (estimating the required quantities and statistics required by the steps for the variables) for preprocessing and updates the model terms, as some of the predictors may be removed, this allows the recipe to be ready to use on other datasets. It doesn’t necessarily actually execute the preprocessing itself, however we will specify in argument for it to do this so that we can take a look at the preprocessed data.
There are some important arguments to know about: 1) training - you must supply a training data set to estimate parameters for preprocessing operations (recipe steps) - this may already be included in your recipe - as is the case for us 2) fresh - if TRUE - will retrain and estimate parameters for any previous steps that were already prepped if you add more steps to the recipe 3) verbose - if TRUE shows the progress as the steps are evaluated and the size of the preprocessed training set 4) retain - if TRUE then the preprocessed training set will be saved within the recipe (as template). This is good if you are likely to add more steps and don’t want to rerun the prep() on the previous steps. However this can make the recipe size large. This is necessary if you want to actually look at the preprocessed data.
oper 1 step dummy [training]
oper 2 step corr [training]
oper 3 step nzv [training]
The retained training set is ~ 0.26 Mb in memory.
[1] "var_info" "term_info" "steps" "template"
[5] "levels" "retained" "tr_info" "orig_lvls"
[9] "last_term_info"
There are also lots of useful things to checkout in the output of prep(). You can see: 1) the steps that were run
2) the variable info (var_info)
3) the model term_info 4) the new levels of the variables 5) the original levels of the variables orig_lvls
6) info about the training data set size and completeness (tr_info)
Note: You may see the prep.recipe() function in material that you read about the recipes package. This is referring to the prep() function of the recipes package.
Rows: 584
Columns: 36
$ id <fct> 1003.001, 1027.0001, 1033.1002, 1055.001,…
$ value <dbl> 9.597647, 10.800000, 11.212174, 12.375394…
$ fips <fct> 1003, 1027, 1033, 1055, 1069, 1073, 1073,…
$ lat <dbl> 30.49800, 33.28126, 34.75878, 33.99375, 3…
$ lon <dbl> -87.88141, -85.80218, -87.65056, -85.9910…
$ CMAQ <dbl> 8.098836, 9.766208, 9.402679, 9.241744, 9…
$ zcta_area <dbl> 190980522, 374132430, 16716984, 154069359…
$ zcta_pop <dbl> 27829, 5103, 9042, 20045, 30217, 9010, 16…
$ imp_a500 <dbl> 0.01730104, 1.96972318, 19.17301038, 16.4…
$ imp_a15000 <dbl> 1.4386207, 0.3359198, 5.2472094, 5.161210…
$ county_area <dbl> 4117521611, 1564252280, 1534877333, 13856…
$ county_pop <dbl> 182265, 13932, 54428, 104430, 101547, 658…
$ log_dist_to_prisec <dbl> 4.648181, 7.219907, 5.760131, 5.261457, 7…
$ log_pri_length_5000 <dbl> 8.517193, 8.517193, 8.517193, 9.066563, 8…
$ log_pri_length_25000 <dbl> 11.32735, 10.12663, 10.15769, 12.01356, 1…
$ log_prisec_length_500 <dbl> 7.295356, 6.214608, 8.611945, 8.740680, 6…
$ log_prisec_length_1000 <dbl> 8.195119, 7.600902, 9.735569, 9.627898, 7…
$ log_prisec_length_5000 <dbl> 10.815042, 10.170878, 11.770407, 11.72888…
$ log_prisec_length_10000 <dbl> 11.886803, 11.405543, 12.840663, 12.76827…
$ log_nei_2008_pm10_sum_15000 <dbl> 2.26783411, 3.31111648, 6.70127741, 4.462…
$ log_nei_2008_pm10_sum_25000 <dbl> 5.628728, 3.311116, 7.148858, 4.678311, 3…
$ popdens_county <dbl> 44.265706, 8.906492, 35.460814, 75.367038…
$ popdens_zcta <dbl> 145.7164307, 13.6395554, 540.8870404, 130…
$ nohs <dbl> 3.3, 11.6, 7.3, 4.3, 5.8, 7.1, 2.7, 11.1,…
$ somehs <dbl> 4.9, 19.1, 15.8, 13.3, 11.6, 17.1, 6.6, 1…
$ hs <dbl> 25.1, 33.9, 30.6, 27.8, 29.8, 37.2, 30.7,…
$ somecollege <dbl> 19.7, 18.8, 20.9, 29.2, 21.4, 23.5, 25.7,…
$ associate <dbl> 8.2, 8.0, 7.6, 10.1, 7.9, 7.3, 8.0, 4.1, …
$ bachelor <dbl> 25.3, 5.5, 12.7, 10.0, 13.7, 5.9, 17.6, 7…
$ grad <dbl> 13.5, 3.1, 5.1, 5.4, 9.8, 2.0, 8.7, 2.9, …
$ pov <dbl> 6.1, 19.5, 19.0, 8.8, 15.6, 25.5, 7.3, 8.…
$ hs_orless <dbl> 33.3, 64.6, 53.7, 45.4, 47.2, 61.4, 40.0,…
$ urc2013 <dbl> 4, 6, 4, 4, 4, 1, 1, 1, 1, 1, 2, 3, 3, 3,…
$ aod <dbl> 37.363636, 34.818182, 36.000000, 43.41666…
$ state_California <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ city_Not.in.a.city <dbl> 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0,…
For easy comparison sake - here is our original data: #### {.scrollable }
Rows: 876
Columns: 50
$ id <fct> 1003.001, 1027.0001, 1033.1002, 1049.1003…
$ value <dbl> 9.597647, 10.800000, 11.212174, 11.659091…
$ fips <fct> 1003, 1027, 1033, 1049, 1055, 1069, 1073,…
$ lat <dbl> 30.49800, 33.28126, 34.75878, 34.28763, 3…
$ lon <dbl> -87.88141, -85.80218, -87.65056, -85.9683…
$ state <chr> "Alabama", "Alabama", "Alabama", "Alabama…
$ county <chr> "Baldwin", "Clay", "Colbert", "DeKalb", "…
$ city <chr> "Fairhope", "Ashland", "Muscle Shoals", "…
$ CMAQ <dbl> 8.098836, 9.766208, 9.402679, 8.534772, 9…
$ zcta <fct> 36532, 36251, 35660, 35962, 35901, 36303,…
$ zcta_area <dbl> 190980522, 374132430, 16716984, 203836235…
$ zcta_pop <dbl> 27829, 5103, 9042, 8300, 20045, 30217, 90…
$ imp_a500 <dbl> 0.01730104, 1.96972318, 19.17301038, 5.78…
$ imp_a1000 <dbl> 1.4096021, 0.8531574, 11.1448962, 3.86764…
$ imp_a5000 <dbl> 3.3360118, 0.9851479, 15.1786154, 1.23114…
$ imp_a10000 <dbl> 1.9879187, 0.5208189, 9.7253870, 1.031646…
$ imp_a15000 <dbl> 1.4386207, 0.3359198, 5.2472094, 0.973044…
$ county_area <dbl> 4117521611, 1564252280, 1534877333, 20126…
$ county_pop <dbl> 182265, 13932, 54428, 71109, 104430, 1015…
$ log_dist_to_prisec <dbl> 4.648181, 7.219907, 5.760131, 3.721489, 5…
$ log_pri_length_5000 <dbl> 8.517193, 8.517193, 8.517193, 8.517193, 9…
$ log_pri_length_10000 <dbl> 9.210340, 9.210340, 9.274303, 10.409411, …
$ log_pri_length_15000 <dbl> 9.630228, 9.615805, 9.658899, 11.173626, …
$ log_pri_length_25000 <dbl> 11.32735, 10.12663, 10.15769, 11.90959, 1…
$ log_prisec_length_500 <dbl> 7.295356, 6.214608, 8.611945, 7.310155, 8…
$ log_prisec_length_1000 <dbl> 8.195119, 7.600902, 9.735569, 8.585843, 9…
$ log_prisec_length_5000 <dbl> 10.815042, 10.170878, 11.770407, 10.21420…
$ log_prisec_length_10000 <dbl> 11.88680, 11.40554, 12.84066, 11.50894, 1…
$ log_prisec_length_15000 <dbl> 12.205723, 12.042963, 13.282656, 12.35366…
$ log_prisec_length_25000 <dbl> 13.41395, 12.79980, 13.79973, 13.55979, 1…
$ log_nei_2008_pm25_sum_10000 <dbl> 0.318035438, 3.218632928, 6.573127301, 0.…
$ log_nei_2008_pm25_sum_15000 <dbl> 1.967358961, 3.218632928, 6.581917457, 3.…
$ log_nei_2008_pm25_sum_25000 <dbl> 5.067308, 3.218633, 6.875900, 4.887665, 4…
$ log_nei_2008_pm10_sum_10000 <dbl> 1.35588511, 3.31111648, 6.69187313, 0.000…
$ log_nei_2008_pm10_sum_15000 <dbl> 2.26783411, 3.31111648, 6.70127741, 3.350…
$ log_nei_2008_pm10_sum_25000 <dbl> 5.628728, 3.311116, 7.148858, 5.171920, 4…
$ popdens_county <dbl> 44.265706, 8.906492, 35.460814, 35.330814…
$ popdens_zcta <dbl> 145.716431, 13.639555, 540.887040, 40.718…
$ nohs <dbl> 3.3, 11.6, 7.3, 14.3, 4.3, 5.8, 7.1, 2.7,…
$ somehs <dbl> 4.9, 19.1, 15.8, 16.7, 13.3, 11.6, 17.1, …
$ hs <dbl> 25.1, 33.9, 30.6, 35.0, 27.8, 29.8, 37.2,…
$ somecollege <dbl> 19.7, 18.8, 20.9, 14.9, 29.2, 21.4, 23.5,…
$ associate <dbl> 8.2, 8.0, 7.6, 5.5, 10.1, 7.9, 7.3, 8.0, …
$ bachelor <dbl> 25.3, 5.5, 12.7, 7.9, 10.0, 13.7, 5.9, 17…
$ grad <dbl> 13.5, 3.1, 5.1, 5.8, 5.4, 9.8, 2.0, 8.7, …
$ pov <dbl> 6.1, 19.5, 19.0, 13.8, 8.8, 15.6, 25.5, 7…
$ hs_orless <dbl> 33.3, 64.6, 53.7, 66.0, 45.4, 47.2, 61.4,…
$ urc2013 <dbl> 4, 6, 4, 6, 4, 4, 1, 1, 1, 1, 1, 1, 1, 2,…
$ urc2006 <dbl> 5, 6, 4, 5, 4, 4, 1, 1, 1, 1, 1, 1, 1, 2,…
$ aod <dbl> 37.36364, 34.81818, 36.00000, 33.08333, 4…
Notice how we only have 36 variables now instead of 50! Two of these are our ID variables (fips and the actual monitor ID (id)) and one is our outcome (value). Thus we only have 33 predictors now. We can also see that variables that we no longer have any categorical variables. Variables like state are gone and only state_California remains as it was the only state identity to have nonzero variance. We can see that California had the largest number of monitors compared to the other states. We can also see that there were more monitors listed as "Not in a city" than any city.
Note: Recall that you must specify retain = TRUE argument of the prep() function to use juice().
Rows: 292
Columns: 36
$ id <fct> 1049.1003, 1073.101, 1073.2006, 1089.0014…
$ value <dbl> 11.659091, 13.114545, 12.228125, 12.23294…
$ fips <fct> 1049, 1073, 1073, 1089, 1103, 1121, 4013,…
$ lat <dbl> 34.28763, 33.54528, 33.38639, 34.68767, 3…
$ lon <dbl> -85.96830, -86.54917, -86.81667, -86.5863…
$ CMAQ <dbl> 8.534772, 9.303766, 10.235612, 9.343611, …
$ zcta_area <dbl> 203836235, 148994881, 56063756, 46963946,…
$ zcta_pop <dbl> 8300, 14212, 32390, 21297, 30545, 7713, 5…
$ imp_a500 <dbl> 5.78200692, 0.06055363, 42.42820069, 23.2…
$ imp_a15000 <dbl> 0.9730444, 2.9956557, 12.7487614, 10.3555…
$ county_area <dbl> 2012662359, 2878192209, 2878192209, 20761…
$ county_pop <dbl> 71109, 658466, 658466, 334811, 119490, 82…
$ log_dist_to_prisec <dbl> 3.721489, 7.301545, 4.721755, 4.659519, 6…
$ log_pri_length_5000 <dbl> 8.517193, 9.683336, 10.737240, 8.517193, …
$ log_pri_length_25000 <dbl> 11.90959, 12.53777, 12.99669, 11.47391, 1…
$ log_prisec_length_500 <dbl> 7.310155, 6.214608, 7.528913, 8.760549, 6…
$ log_prisec_length_1000 <dbl> 8.585843, 7.600902, 9.342290, 9.543183, 8…
$ log_prisec_length_5000 <dbl> 10.214200, 11.262645, 11.713190, 11.48606…
$ log_prisec_length_10000 <dbl> 11.50894, 12.14101, 12.53899, 12.68440, 1…
$ log_nei_2008_pm10_sum_15000 <dbl> 3.3500444, 6.6241114, 5.8268686, 3.861625…
$ log_nei_2008_pm10_sum_25000 <dbl> 5.1719202, 7.5490587, 8.8205542, 5.219092…
$ popdens_county <dbl> 35.330814, 228.777633, 228.777633, 161.26…
$ popdens_zcta <dbl> 40.718962, 95.385827, 577.735106, 453.475…
$ nohs <dbl> 14.3, 7.2, 0.8, 1.2, 4.8, 16.7, 19.1, 6.4…
$ somehs <dbl> 16.7, 12.2, 2.6, 3.1, 7.8, 33.3, 15.6, 9.…
$ hs <dbl> 35.0, 32.2, 12.9, 15.1, 28.7, 37.5, 26.5,…
$ somecollege <dbl> 14.9, 19.0, 17.9, 20.5, 25.0, 12.5, 18.0,…
$ associate <dbl> 5.5, 6.8, 5.2, 6.5, 7.5, 0.0, 6.0, 8.8, 3…
$ bachelor <dbl> 7.9, 14.8, 35.5, 30.4, 18.2, 0.0, 10.6, 1…
$ grad <dbl> 5.8, 7.7, 25.2, 23.3, 8.0, 0.0, 4.1, 5.7,…
$ pov <dbl> 13.8, 10.5, 2.1, 5.2, 8.3, 18.8, 21.4, 14…
$ hs_orless <dbl> 66.0, 51.6, 16.3, 19.4, 41.3, 87.5, 61.2,…
$ urc2013 <dbl> 6, 1, 1, 3, 4, 5, 1, 2, 5, 4, 4, 6, 6, 1,…
$ aod <dbl> 33.08333, 42.45455, 44.25000, 42.41667, 4…
$ state_California <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,…
$ city_Not.in.a.city <dbl> NA, NA, NA, NA, NA, NA, 0, NA, NA, NA, NA…
Notice that our city_Not.in.a.city variable seems to be NA values. Why might that be?
Ah! Perphas it is because some of our levels were not previously seen in the training set!
Let’s take a look using the set operations of the dplyr package. We can take a look at cities that were different between the test and training set.
[1] 376 1
[1] 51 1
Indeed, there are lots of different cities in our test data that are not in our training data!
Maybe remove this?: Thus we need to update our original recipe to include a very important step function called step_novel() this helps in cases like this were there are new factors in our testing set that were not in our training set. It is a good idea to include this in most of your recipes where you have a categorical variables with many distinct values. This step needs to come before we create dummy variables. However, we are also creating a dummy variable from this, which still results in a problem.
Let’s modify the city variable to be values of in a city or not in a city using the if_else() function of dplyr. Alternatively you could create a custom step function to do this and add the step function to your recipe, but that is beyond the scope of this case study.
We need to create a new recipe to move forward, as the levels of our variables are established then. We would also potentially have this issue for state and county. So let’s also do a similar thing for state. The county variables appears to get dropped due to either correlation or near zero variance. It is likely due to near zero variance becuase this is the more granular of these geographic categorical variables and likely sparse.
train_pm[["city"]] <- if_else(
train_pm[["city"]] == "Not in a city",
true = "Not in a city", false = "In a city")
test_pm[["city"]] <- if_else(
test_pm[["city"]] == "Not in a city",
true = "Not in a city", false = "In a city")
train_pm[["state"]] <- if_else(
train_pm[["state"]] == "California",
true = "California", false = "Not California")
test_pm[["state"]] <- if_else(
test_pm[["state"]] == "California",
true = "California", false = "Not California")
novel_rec <-recipe(train_pm) %>%
update_role(everything(), new_role = "predictor")%>%
update_role(value, new_role = "outcome")%>%
update_role(id, new_role = "id variable")%>%
update_role("fips", new_role = "county id")%>%
# create numeric dummy variables to encode for categorical variables
step_dummy(state, county, city, zcta, one_hot = TRUE) %>%
# identify and reomve all correlated predictors (now that they are numeric)
step_corr(all_numeric())%>%
# identify variables with near zero variance and remove
step_nzv(all_numeric()) #
Now let’s retrain our training data and try baking our test data:
oper 1 step dummy [training]
oper 2 step corr [training]
oper 3 step nzv [training]
The retained training set is ~ 0.26 Mb in memory.
Rows: 584
Columns: 37
$ id <fct> 1003.001, 1027.0001, 1033.1002, 1055.001,…
$ value <dbl> 9.597647, 10.800000, 11.212174, 12.375394…
$ fips <fct> 1003, 1027, 1033, 1055, 1069, 1073, 1073,…
$ lat <dbl> 30.49800, 33.28126, 34.75878, 33.99375, 3…
$ lon <dbl> -87.88141, -85.80218, -87.65056, -85.9910…
$ CMAQ <dbl> 8.098836, 9.766208, 9.402679, 9.241744, 9…
$ zcta_area <dbl> 190980522, 374132430, 16716984, 154069359…
$ zcta_pop <dbl> 27829, 5103, 9042, 20045, 30217, 9010, 16…
$ imp_a500 <dbl> 0.01730104, 1.96972318, 19.17301038, 16.4…
$ imp_a15000 <dbl> 1.4386207, 0.3359198, 5.2472094, 5.161210…
$ county_area <dbl> 4117521611, 1564252280, 1534877333, 13856…
$ county_pop <dbl> 182265, 13932, 54428, 104430, 101547, 658…
$ log_dist_to_prisec <dbl> 4.648181, 7.219907, 5.760131, 5.261457, 7…
$ log_pri_length_5000 <dbl> 8.517193, 8.517193, 8.517193, 9.066563, 8…
$ log_pri_length_25000 <dbl> 11.32735, 10.12663, 10.15769, 12.01356, 1…
$ log_prisec_length_500 <dbl> 7.295356, 6.214608, 8.611945, 8.740680, 6…
$ log_prisec_length_1000 <dbl> 8.195119, 7.600902, 9.735569, 9.627898, 7…
$ log_prisec_length_5000 <dbl> 10.815042, 10.170878, 11.770407, 11.72888…
$ log_prisec_length_10000 <dbl> 11.886803, 11.405543, 12.840663, 12.76827…
$ log_prisec_length_25000 <dbl> 13.41395, 12.79980, 13.79973, 13.70026, 1…
$ log_nei_2008_pm10_sum_15000 <dbl> 2.26783411, 3.31111648, 6.70127741, 4.462…
$ log_nei_2008_pm10_sum_25000 <dbl> 5.628728, 3.311116, 7.148858, 4.678311, 3…
$ popdens_county <dbl> 44.265706, 8.906492, 35.460814, 75.367038…
$ popdens_zcta <dbl> 145.7164307, 13.6395554, 540.8870404, 130…
$ nohs <dbl> 3.3, 11.6, 7.3, 4.3, 5.8, 7.1, 2.7, 11.1,…
$ somehs <dbl> 4.9, 19.1, 15.8, 13.3, 11.6, 17.1, 6.6, 1…
$ hs <dbl> 25.1, 33.9, 30.6, 27.8, 29.8, 37.2, 30.7,…
$ somecollege <dbl> 19.7, 18.8, 20.9, 29.2, 21.4, 23.5, 25.7,…
$ associate <dbl> 8.2, 8.0, 7.6, 10.1, 7.9, 7.3, 8.0, 4.1, …
$ bachelor <dbl> 25.3, 5.5, 12.7, 10.0, 13.7, 5.9, 17.6, 7…
$ grad <dbl> 13.5, 3.1, 5.1, 5.4, 9.8, 2.0, 8.7, 2.9, …
$ pov <dbl> 6.1, 19.5, 19.0, 8.8, 15.6, 25.5, 7.3, 8.…
$ hs_orless <dbl> 33.3, 64.6, 53.7, 45.4, 47.2, 61.4, 40.0,…
$ urc2013 <dbl> 4, 6, 4, 4, 4, 1, 1, 1, 1, 1, 2, 3, 3, 3,…
$ aod <dbl> 37.363636, 34.818182, 36.000000, 43.41666…
$ state_Not.California <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ city_Not.in.a.city <dbl> 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0,…
Notice, it looks like we gained the log_prisec_length_25000 back with this recipe using the data with our changes to state and city.
Rows: 292
Columns: 37
$ id <fct> 1049.1003, 1073.101, 1073.2006, 1089.0014…
$ value <dbl> 11.659091, 13.114545, 12.228125, 12.23294…
$ fips <fct> 1049, 1073, 1073, 1089, 1103, 1121, 4013,…
$ lat <dbl> 34.28763, 33.54528, 33.38639, 34.68767, 3…
$ lon <dbl> -85.96830, -86.54917, -86.81667, -86.5863…
$ CMAQ <dbl> 8.534772, 9.303766, 10.235612, 9.343611, …
$ zcta_area <dbl> 203836235, 148994881, 56063756, 46963946,…
$ zcta_pop <dbl> 8300, 14212, 32390, 21297, 30545, 7713, 5…
$ imp_a500 <dbl> 5.78200692, 0.06055363, 42.42820069, 23.2…
$ imp_a15000 <dbl> 0.9730444, 2.9956557, 12.7487614, 10.3555…
$ county_area <dbl> 2012662359, 2878192209, 2878192209, 20761…
$ county_pop <dbl> 71109, 658466, 658466, 334811, 119490, 82…
$ log_dist_to_prisec <dbl> 3.721489, 7.301545, 4.721755, 4.659519, 6…
$ log_pri_length_5000 <dbl> 8.517193, 9.683336, 10.737240, 8.517193, …
$ log_pri_length_25000 <dbl> 11.90959, 12.53777, 12.99669, 11.47391, 1…
$ log_prisec_length_500 <dbl> 7.310155, 6.214608, 7.528913, 8.760549, 6…
$ log_prisec_length_1000 <dbl> 8.585843, 7.600902, 9.342290, 9.543183, 8…
$ log_prisec_length_5000 <dbl> 10.214200, 11.262645, 11.713190, 11.48606…
$ log_prisec_length_10000 <dbl> 11.50894, 12.14101, 12.53899, 12.68440, 1…
$ log_prisec_length_25000 <dbl> 13.55979, 14.08915, 14.27363, 13.87170, 1…
$ log_nei_2008_pm10_sum_15000 <dbl> 3.3500444, 6.6241114, 5.8268686, 3.861625…
$ log_nei_2008_pm10_sum_25000 <dbl> 5.1719202, 7.5490587, 8.8205542, 5.219092…
$ popdens_county <dbl> 35.330814, 228.777633, 228.777633, 161.26…
$ popdens_zcta <dbl> 40.718962, 95.385827, 577.735106, 453.475…
$ nohs <dbl> 14.3, 7.2, 0.8, 1.2, 4.8, 16.7, 19.1, 6.4…
$ somehs <dbl> 16.7, 12.2, 2.6, 3.1, 7.8, 33.3, 15.6, 9.…
$ hs <dbl> 35.0, 32.2, 12.9, 15.1, 28.7, 37.5, 26.5,…
$ somecollege <dbl> 14.9, 19.0, 17.9, 20.5, 25.0, 12.5, 18.0,…
$ associate <dbl> 5.5, 6.8, 5.2, 6.5, 7.5, 0.0, 6.0, 8.8, 3…
$ bachelor <dbl> 7.9, 14.8, 35.5, 30.4, 18.2, 0.0, 10.6, 1…
$ grad <dbl> 5.8, 7.7, 25.2, 23.3, 8.0, 0.0, 4.1, 5.7,…
$ pov <dbl> 13.8, 10.5, 2.1, 5.2, 8.3, 18.8, 21.4, 14…
$ hs_orless <dbl> 66.0, 51.6, 16.3, 19.4, 41.3, 87.5, 61.2,…
$ urc2013 <dbl> 6, 1, 1, 3, 4, 5, 1, 2, 5, 4, 4, 6, 6, 1,…
$ aod <dbl> 33.08333, 42.45455, 44.25000, 42.41667, 4…
$ state_Not.California <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0,…
$ city_Not.in.a.city <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
Great now we no longer have NA values! :)
Note: if you use the skip option for some of the preprocessing steps, be careful. juice() will show all of the results ignoring skip = TRUE. bake() will not necessarily conduct these steps on the new data.
Looking at model fit with broom
The broom package allows for an easy/tidy way to look at the fitted model:
tidy() grabs the coefficients from the model
glance() summarizes the model fit and gives us an idea about how well the model might perform augment() gives a 150 row observation level summary of the data and fit
These broom functions currently only work with parsnip objects not raw workflows objects. To use the tidy() function with workflows we need to first use the pull_workflow_fit() function.
# A tibble: 34 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 state_Not.California -3.44 0.436 -7.91 1.45e-14
2 CMAQ 0.285 0.0430 6.61 8.90e-11
3 aod 0.0256 0.00575 4.45 1.06e- 5
4 lon 0.0258 0.00998 2.58 1.00e- 2
5 county_pop -0.000000216 0.0000000934 -2.31 2.14e- 2
6 urc2013 0.215 0.101 2.13 3.35e- 2
7 grad -2.28 1.20 -1.90 5.78e- 2
8 bachelor -2.28 1.20 -1.90 5.84e- 2
9 somecollege -2.27 1.20 -1.89 5.87e- 2
10 associate -2.27 1.20 -1.89 5.90e- 2
# … with 24 more rows
# A tibble: 1 x 11
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
<dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.433 0.399 2.08 12.7 1.02e-48 34 -1240. 2550. 2703.
# … with 2 more variables: deviance <dbl>, df.residual <int>
# A tibble: 584 x 42
value lat lon CMAQ zcta_area zcta_pop imp_a500 imp_a15000 county_area
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 9.60 30.5 -87.9 8.10 190980522 27829 0.0173 1.44 4117521611
2 10.8 33.3 -85.8 9.77 374132430 5103 1.97 0.336 1564252280
3 11.2 34.8 -87.7 9.40 16716984 9042 19.2 5.25 1534877333
4 12.4 34.0 -86.0 9.24 154069359 20045 16.5 5.16 1385618994
5 10.5 31.2 -85.4 9.12 162685124 30217 19.1 4.74 1501737720
6 15.6 33.6 -86.8 10.2 26929603 9010 41.8 17.5 2878192209
7 12.4 33.3 -87.0 10.2 166239542 16140 1.70 4.30 2878192209
8 11.1 33.5 -87.3 8.16 385566685 3699 0 0.162 3423328940
9 14.6 33.5 -86.9 10.2 10636977 11458 43.6 15.6 2878192209
10 12.0 33.7 -86.7 9.30 150661846 21725 1.48 4.25 2878192209
# … with 574 more rows, and 33 more variables: county_pop <dbl>,
# log_dist_to_prisec <dbl>, log_pri_length_5000 <dbl>,
# log_pri_length_25000 <dbl>, log_prisec_length_500 <dbl>,
# log_prisec_length_1000 <dbl>, log_prisec_length_5000 <dbl>,
# log_prisec_length_10000 <dbl>, log_prisec_length_25000 <dbl>,
# log_nei_2008_pm10_sum_15000 <dbl>, log_nei_2008_pm10_sum_25000 <dbl>,
# popdens_county <dbl>, popdens_zcta <dbl>, nohs <dbl>, somehs <dbl>,
# hs <dbl>, somecollege <dbl>, associate <dbl>, bachelor <dbl>, grad <dbl>,
# pov <dbl>, hs_orless <dbl>, urc2013 <dbl>, aod <dbl>,
# state_Not.California <dbl>, city_Not.in.a.city <dbl>, .fitted <dbl>,
# .se.fit <dbl>, .resid <dbl>, .hat <dbl>, .sigma <dbl>, .cooksd <dbl>,
# .std.resid <dbl>
[1] TRUE
OK, so we have fit our model on our training data, which means we have created a model to predict values of air pollution based on the predictors that we have included. Yay!
Let’s take a look at how well our model fit our training data:
1 2 3 4 5 6 7 8
9.461664 10.429189 11.795351 11.139746 10.863402 11.091857 10.073411 8.189021
9 10 11 12 13 14 15 16
11.517888 10.618367 9.557499 10.218323 9.865608 11.011802 12.139040 10.202025
17 18 19 20 21 22 23 24
10.517763 9.548571 6.837646 7.472453 7.018230 10.126795 10.133236 9.582305
25 26 27 28 29 30 31 32
9.702583 6.862089 7.005665 8.477788 8.892971 9.244236 11.417965 10.464960
33 34 35 36 37 38 39 40
9.652665 9.943594 11.435373 10.090206 10.304301 11.188076 11.193689 10.767763
41 42 43 44 45 46 47 48
10.722817 9.949272 10.521367 9.976252 11.955651 11.674338 13.902418 13.582229
49 50 51 52 53 54 55 56
14.713622 11.219505 11.320158 15.169192 8.682176 13.357477 11.166948 14.162085
57 58 59 60 61 62 63 64
13.128209 10.867659 12.557864 11.404720 14.609938 10.003168 11.917995 9.886651
65 66 67 68 69 70 71 72
11.161819 12.326977 12.722475 11.439546 11.380029 14.615337 10.945608 12.754387
73 74 75 76 77 78 79 80
11.464256 13.050833 11.162822 14.092701 10.025276 12.181293 12.573405 12.850499
81 82 83 84 85 86 87 88
12.523434 13.398889 8.867379 12.250365 15.385913 10.726820 12.320977 12.282267
89 90 91 92 93 94 95 96
10.903510 13.504498 10.678776 11.367829 14.267153 11.959658 14.860756 12.245437
97 98 99 100 101 102 103 104
11.009600 12.303933 10.376440 9.254959 8.458964 10.797185 9.615059 8.319766
105 106 107 108 109 110 111 112
6.584026 8.480982 7.889599 8.520672 9.507922 9.460726 12.291791 10.315367
113 114 115 116 117 118 119 120
9.899759 10.270368 9.531778 8.476645 10.506159 10.457589 11.548915 12.211470
121 122 123 124 125 126 127 128
12.754005 11.760960 11.527420 13.358326 12.970527 11.450112 9.662891 8.860986
129 130 131 132 133 134 135 136
9.630399 9.107099 9.336214 10.377070 10.913731 10.012413 11.655663 9.197286
137 138 139 140 141 142 143 144
9.895577 8.992575 11.062182 10.928292 11.173204 9.290713 9.462250 9.815623
145 146 147 148 149 150 151 152
10.567132 10.041853 12.438193 12.837043 11.155833 11.068409 11.568628 10.646408
153 154 155 156 157 158 159 160
11.120305 10.674787 11.342914 11.600863 9.872202 12.933636 12.330424 10.674265
161 162 163 164 165 166 167 168
10.692636 8.591527 8.483559 8.277465 9.219110 7.999004 7.325638 10.668789
169 170 171 172 173 174 175 176
10.442171 13.278721 12.987068 12.609307 12.139812 13.311952 9.560588 11.231509
177 178 179 180 181 182 183 184
9.614572 11.388626 11.027188 10.701644 10.408321 11.281993 11.952935 12.314095
185 186 187 188 189 190 191 192
11.343060 12.301626 11.670473 11.538465 11.450721 12.583606 9.908812 11.957886
193 194 195 196 197 198 199 200
12.476480 11.356784 11.408551 11.386678 11.371265 11.869010 11.386166 10.716040
201 202 203 204 205 206 207 208
13.547182 13.430563 14.488555 13.382721 11.721882 13.138956 11.617227 12.919960
209 210 211 212 213 214 215 216
11.610622 11.234236 12.441517 11.565354 11.775386 11.870380 11.612343 10.513345
217 218 219 220 221 222 223 224
11.332066 11.356102 8.622875 9.342198 10.532866 11.075768 10.814015 9.148910
225 226 227 228 229 230 231 232
10.259892 8.851762 10.150785 11.702025 11.247041 9.448320 8.715848 10.465157
233 234 235 236 237 238 239 240
9.063738 8.512195 9.359375 8.694810 12.250888 10.950829 11.172992 11.113227
241 242 243 244 245 246 247 248
12.539577 11.130235 11.520853 13.270299 12.445247 11.641586 11.084325 11.681937
249 250 251 252 253 254 255 256
12.903454 11.675062 11.868743 11.887980 12.142425 11.125280 10.206552 13.083514
257 258 259 260 261 262 263 264
12.553520 13.232212 11.590232 11.070216 11.072157 11.721994 10.595329 14.228343
265 266 267 268 269 270 271 272
13.651194 10.111261 12.956021 13.017707 10.712920 13.277355 11.914351 11.578296
273 274 275 276 277 278 279 280
12.601671 13.808564 13.137512 11.706520 11.860984 9.304048 10.849169 10.160520
281 282 283 284 285 286 287 288
10.957044 10.853018 11.595010 11.208375 11.222850 10.646720 10.875621 10.875865
289 290 291 292 293 294 295 296
12.139867 9.344793 7.829627 7.419011 10.276070 8.595271 11.984527 10.025066
297 298 299 300 301 302 303 304
11.658014 10.844432 13.668283 13.877468 11.842378 12.609642 12.315624 7.550419
305 306 307 308 309 310 311 312
10.543817 10.140751 10.640678 9.986984 10.672587 12.808831 8.608027 8.762374
313 314 315 316 317 318 319 320
9.800301 8.990358 10.913175 10.986842 9.848777 10.184492 10.730669 8.925857
321 322 323 324 325 326 327 328
9.810568 11.051639 12.301691 13.244583 12.143078 7.420562 7.181852 8.115063
329 330 331 332 333 334 335 336
7.557550 7.598764 7.963658 6.033631 8.174960 7.009403 6.547324 6.721532
337 338 339 340 341 342 343 344
10.495751 10.127753 8.447220 8.955561 9.111061 6.229223 7.721007 8.702168
345 346 347 348 349 350 351 352
9.796073 10.735891 10.674564 9.993831 9.387035 11.404517 12.718761 12.490109
353 354 355 356 357 358 359 360
12.490096 12.868844 14.159822 11.862943 12.490662 12.814318 13.292919 11.342429
361 362 363 364 365 366 367 368
11.790078 13.543314 12.475061 11.439849 9.501710 9.938767 8.662747 9.361561
369 370 371 372 373 374 375 376
7.333849 7.619211 8.424590 8.182639 8.244907 11.421981 12.598988 12.766756
377 378 379 380 381 382 383 384
10.004449 11.846089 11.484665 8.502844 9.985085 14.093239 11.022401 10.343070
385 386 387 388 389 390 391 392
10.870999 11.580658 8.164514 9.816118 13.810603 10.765304 10.679352 8.943409
393 394 395 396 397 398 399 400
11.394455 10.988983 11.911609 11.169336 10.926602 11.106484 10.975384 10.570735
401 402 403 404 405 406 407 408
10.800890 11.216585 11.003865 12.058352 8.979768 10.417642 10.840787 8.564483
409 410 411 412 413 414 415 416
10.468748 7.722598 9.024726 8.627277 8.712337 12.306801 11.259834 12.135047
417 418 419 420 421 422 423 424
12.909887 13.900961 14.613681 14.659878 14.516719 11.834130 12.835139 11.524950
425 426 427 428 429 430 431 432
11.205715 12.690535 12.366608 12.257143 11.554346 11.571765 10.268979 13.831592
433 434 435 436 437 438 439 440
12.488541 12.274712 11.792411 12.159000 11.286455 12.398744 12.720250 11.839790
441 442 443 444 445 446 447 448
12.042530 10.616712 11.592961 8.680015 10.002905 10.098331 9.303447 8.025186
449 450 451 452 453 454 455 456
8.160942 7.701500 7.573096 6.440509 8.413890 7.573817 8.654901 6.527707
457 458 459 460 461 462 463 464
9.316892 9.239795 7.577079 7.234717 7.744910 10.396131 11.349235 10.531142
465 466 467 468 469 470 471 472
11.286622 11.324207 11.904075 11.984744 11.254344 10.087881 12.491983 12.727448
473 474 475 476 477 478 479 480
11.023725 9.843285 12.532214 10.909423 11.556938 11.246725 13.678810 10.995575
481 482 483 484 485 486 487 488
13.406587 12.947436 11.536939 10.519024 10.614975 10.691272 12.606993 11.966835
489 490 491 492 493 494 495 496
11.220192 9.923407 11.012463 9.424555 10.077508 10.808538 10.687251 11.438914
497 498 499 500 501 502 503 504
12.408781 10.434905 8.167398 7.968639 7.492743 7.971856 7.794390 8.631370
505 506 507 508 509 510 511 512
7.279254 7.833077 10.330354 10.925965 10.909906 8.503572 12.117403 10.667929
513 514 515 516 517 518 519 520
10.129061 10.550327 9.918316 11.522068 11.296737 11.553654 10.476981 11.941762
521 522 523 524 525 526 527 528
10.578826 11.080807 9.211676 10.323909 9.621231 8.335038 8.188955 9.082492
529 530 531 532 533 534 535 536
8.075345 9.420944 8.314572 9.247623 8.538591 9.859266 7.647124 9.902155
537 538 539 540 541 542 543 544
8.546524 9.920558 11.604894 11.918312 12.416602 10.819737 11.466657 11.179962
545 546 547 548 549 550 551 552
11.665760 9.817793 14.172912 10.192052 11.355575 13.772991 8.847900 7.970775
553 554 555 556 557 558 559 560
7.460673 8.349559 8.135941 11.447727 11.618942 11.936942 10.918982 11.041702
561 562 563 564 565 566 567 568
9.949377 11.499303 10.904615 11.689807 11.747444 11.380585 11.009586 10.850464
569 570 571 572 573 574 575 576
11.602371 11.551376 12.485626 11.270907 10.170153 10.619552 9.784335 8.033726
577 578 579 580 581 582 583 584
10.810373 8.993775 5.743434 8.959191 8.089329 6.061699 8.339742 6.931294
# A tibble: 584 x 8
value .fitted .se.fit .resid .hat .sigma .cooksd .std.resid
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 9.60 9.46 0.375 0.136 0.0324 2.09 0.00000433 0.0663
2 10.8 10.4 0.383 0.371 0.0338 2.09 0.0000338 0.181
3 11.2 11.8 0.404 -0.583 0.0376 2.09 0.0000936 -0.285
4 12.4 11.1 0.388 1.24 0.0346 2.08 0.000384 0.604
5 10.5 10.9 0.426 -0.355 0.0418 2.09 0.0000388 -0.174
6 15.6 11.1 0.379 4.50 0.0332 2.08 0.00486 2.20
7 12.4 10.1 0.459 2.33 0.0485 2.08 0.00196 1.14
8 11.1 8.19 0.777 2.91 0.139 2.08 0.0108 1.51
9 14.6 11.5 0.378 3.03 0.0329 2.08 0.00219 1.48
10 12.0 10.6 0.452 1.33 0.0470 2.08 0.000623 0.656
# … with 574 more rows
1 2 3 4 5 6 7 8
9.461664 10.429189 11.795351 11.139746 10.863402 11.091857 10.073411 8.189021
9 10 11 12 13 14 15 16
11.517888 10.618367 9.557499 10.218323 9.865608 11.011802 12.139040 10.202025
17 18 19 20 21 22 23 24
10.517763 9.548571 6.837646 7.472453 7.018230 10.126795 10.133236 9.582305
25 26 27 28 29 30 31 32
9.702583 6.862089 7.005665 8.477788 8.892971 9.244236 11.417965 10.464960
33 34 35 36 37 38 39 40
9.652665 9.943594 11.435373 10.090206 10.304301 11.188076 11.193689 10.767763
41 42 43 44 45 46 47 48
10.722817 9.949272 10.521367 9.976252 11.955651 11.674338 13.902418 13.582229
49 50 51 52 53 54 55 56
14.713622 11.219505 11.320158 15.169192 8.682176 13.357477 11.166948 14.162085
57 58 59 60 61 62 63 64
13.128209 10.867659 12.557864 11.404720 14.609938 10.003168 11.917995 9.886651
65 66 67 68 69 70 71 72
11.161819 12.326977 12.722475 11.439546 11.380029 14.615337 10.945608 12.754387
73 74 75 76 77 78 79 80
11.464256 13.050833 11.162822 14.092701 10.025276 12.181293 12.573405 12.850499
81 82 83 84 85 86 87 88
12.523434 13.398889 8.867379 12.250365 15.385913 10.726820 12.320977 12.282267
89 90 91 92 93 94 95 96
10.903510 13.504498 10.678776 11.367829 14.267153 11.959658 14.860756 12.245437
97 98 99 100 101 102 103 104
11.009600 12.303933 10.376440 9.254959 8.458964 10.797185 9.615059 8.319766
105 106 107 108 109 110 111 112
6.584026 8.480982 7.889599 8.520672 9.507922 9.460726 12.291791 10.315367
113 114 115 116 117 118 119 120
9.899759 10.270368 9.531778 8.476645 10.506159 10.457589 11.548915 12.211470
121 122 123 124 125 126 127 128
12.754005 11.760960 11.527420 13.358326 12.970527 11.450112 9.662891 8.860986
129 130 131 132 133 134 135 136
9.630399 9.107099 9.336214 10.377070 10.913731 10.012413 11.655663 9.197286
137 138 139 140 141 142 143 144
9.895577 8.992575 11.062182 10.928292 11.173204 9.290713 9.462250 9.815623
145 146 147 148 149 150 151 152
10.567132 10.041853 12.438193 12.837043 11.155833 11.068409 11.568628 10.646408
153 154 155 156 157 158 159 160
11.120305 10.674787 11.342914 11.600863 9.872202 12.933636 12.330424 10.674265
161 162 163 164 165 166 167 168
10.692636 8.591527 8.483559 8.277465 9.219110 7.999004 7.325638 10.668789
169 170 171 172 173 174 175 176
10.442171 13.278721 12.987068 12.609307 12.139812 13.311952 9.560588 11.231509
177 178 179 180 181 182 183 184
9.614572 11.388626 11.027188 10.701644 10.408321 11.281993 11.952935 12.314095
185 186 187 188 189 190 191 192
11.343060 12.301626 11.670473 11.538465 11.450721 12.583606 9.908812 11.957886
193 194 195 196 197 198 199 200
12.476480 11.356784 11.408551 11.386678 11.371265 11.869010 11.386166 10.716040
201 202 203 204 205 206 207 208
13.547182 13.430563 14.488555 13.382721 11.721882 13.138956 11.617227 12.919960
209 210 211 212 213 214 215 216
11.610622 11.234236 12.441517 11.565354 11.775386 11.870380 11.612343 10.513345
217 218 219 220 221 222 223 224
11.332066 11.356102 8.622875 9.342198 10.532866 11.075768 10.814015 9.148910
225 226 227 228 229 230 231 232
10.259892 8.851762 10.150785 11.702025 11.247041 9.448320 8.715848 10.465157
233 234 235 236 237 238 239 240
9.063738 8.512195 9.359375 8.694810 12.250888 10.950829 11.172992 11.113227
241 242 243 244 245 246 247 248
12.539577 11.130235 11.520853 13.270299 12.445247 11.641586 11.084325 11.681937
249 250 251 252 253 254 255 256
12.903454 11.675062 11.868743 11.887980 12.142425 11.125280 10.206552 13.083514
257 258 259 260 261 262 263 264
12.553520 13.232212 11.590232 11.070216 11.072157 11.721994 10.595329 14.228343
265 266 267 268 269 270 271 272
13.651194 10.111261 12.956021 13.017707 10.712920 13.277355 11.914351 11.578296
273 274 275 276 277 278 279 280
12.601671 13.808564 13.137512 11.706520 11.860984 9.304048 10.849169 10.160520
281 282 283 284 285 286 287 288
10.957044 10.853018 11.595010 11.208375 11.222850 10.646720 10.875621 10.875865
289 290 291 292 293 294 295 296
12.139867 9.344793 7.829627 7.419011 10.276070 8.595271 11.984527 10.025066
297 298 299 300 301 302 303 304
11.658014 10.844432 13.668283 13.877468 11.842378 12.609642 12.315624 7.550419
305 306 307 308 309 310 311 312
10.543817 10.140751 10.640678 9.986984 10.672587 12.808831 8.608027 8.762374
313 314 315 316 317 318 319 320
9.800301 8.990358 10.913175 10.986842 9.848777 10.184492 10.730669 8.925857
321 322 323 324 325 326 327 328
9.810568 11.051639 12.301691 13.244583 12.143078 7.420562 7.181852 8.115063
329 330 331 332 333 334 335 336
7.557550 7.598764 7.963658 6.033631 8.174960 7.009403 6.547324 6.721532
337 338 339 340 341 342 343 344
10.495751 10.127753 8.447220 8.955561 9.111061 6.229223 7.721007 8.702168
345 346 347 348 349 350 351 352
9.796073 10.735891 10.674564 9.993831 9.387035 11.404517 12.718761 12.490109
353 354 355 356 357 358 359 360
12.490096 12.868844 14.159822 11.862943 12.490662 12.814318 13.292919 11.342429
361 362 363 364 365 366 367 368
11.790078 13.543314 12.475061 11.439849 9.501710 9.938767 8.662747 9.361561
369 370 371 372 373 374 375 376
7.333849 7.619211 8.424590 8.182639 8.244907 11.421981 12.598988 12.766756
377 378 379 380 381 382 383 384
10.004449 11.846089 11.484665 8.502844 9.985085 14.093239 11.022401 10.343070
385 386 387 388 389 390 391 392
10.870999 11.580658 8.164514 9.816118 13.810603 10.765304 10.679352 8.943409
393 394 395 396 397 398 399 400
11.394455 10.988983 11.911609 11.169336 10.926602 11.106484 10.975384 10.570735
401 402 403 404 405 406 407 408
10.800890 11.216585 11.003865 12.058352 8.979768 10.417642 10.840787 8.564483
409 410 411 412 413 414 415 416
10.468748 7.722598 9.024726 8.627277 8.712337 12.306801 11.259834 12.135047
417 418 419 420 421 422 423 424
12.909887 13.900961 14.613681 14.659878 14.516719 11.834130 12.835139 11.524950
425 426 427 428 429 430 431 432
11.205715 12.690535 12.366608 12.257143 11.554346 11.571765 10.268979 13.831592
433 434 435 436 437 438 439 440
12.488541 12.274712 11.792411 12.159000 11.286455 12.398744 12.720250 11.839790
441 442 443 444 445 446 447 448
12.042530 10.616712 11.592961 8.680015 10.002905 10.098331 9.303447 8.025186
449 450 451 452 453 454 455 456
8.160942 7.701500 7.573096 6.440509 8.413890 7.573817 8.654901 6.527707
457 458 459 460 461 462 463 464
9.316892 9.239795 7.577079 7.234717 7.744910 10.396131 11.349235 10.531142
465 466 467 468 469 470 471 472
11.286622 11.324207 11.904075 11.984744 11.254344 10.087881 12.491983 12.727448
473 474 475 476 477 478 479 480
11.023725 9.843285 12.532214 10.909423 11.556938 11.246725 13.678810 10.995575
481 482 483 484 485 486 487 488
13.406587 12.947436 11.536939 10.519024 10.614975 10.691272 12.606993 11.966835
489 490 491 492 493 494 495 496
11.220192 9.923407 11.012463 9.424555 10.077508 10.808538 10.687251 11.438914
497 498 499 500 501 502 503 504
12.408781 10.434905 8.167398 7.968639 7.492743 7.971856 7.794390 8.631370
505 506 507 508 509 510 511 512
7.279254 7.833077 10.330354 10.925965 10.909906 8.503572 12.117403 10.667929
513 514 515 516 517 518 519 520
10.129061 10.550327 9.918316 11.522068 11.296737 11.553654 10.476981 11.941762
521 522 523 524 525 526 527 528
10.578826 11.080807 9.211676 10.323909 9.621231 8.335038 8.188955 9.082492
529 530 531 532 533 534 535 536
8.075345 9.420944 8.314572 9.247623 8.538591 9.859266 7.647124 9.902155
537 538 539 540 541 542 543 544
8.546524 9.920558 11.604894 11.918312 12.416602 10.819737 11.466657 11.179962
545 546 547 548 549 550 551 552
11.665760 9.817793 14.172912 10.192052 11.355575 13.772991 8.847900 7.970775
553 554 555 556 557 558 559 560
7.460673 8.349559 8.135941 11.447727 11.618942 11.936942 10.918982 11.041702
561 562 563 564 565 566 567 568
9.949377 11.499303 10.904615 11.689807 11.747444 11.380585 11.009586 10.850464
569 570 571 572 573 574 575 576
11.602371 11.551376 12.485626 11.270907 10.170153 10.619552 9.784335 8.033726
577 578 579 580 581 582 583 584
10.810373 8.993775 5.743434 8.959191 8.089329 6.061699 8.339742 6.931294
# A tibble: 584 x 8
value .fitted .se.fit .resid .hat .sigma .cooksd .std.resid
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 9.60 9.46 0.375 0.136 0.0324 2.09 0.00000433 0.0663
2 10.8 10.4 0.383 0.371 0.0338 2.09 0.0000338 0.181
3 11.2 11.8 0.404 -0.583 0.0376 2.09 0.0000936 -0.285
4 12.4 11.1 0.388 1.24 0.0346 2.08 0.000384 0.604
5 10.5 10.9 0.426 -0.355 0.0418 2.09 0.0000388 -0.174
6 15.6 11.1 0.379 4.50 0.0332 2.08 0.00486 2.20
7 12.4 10.1 0.459 2.33 0.0485 2.08 0.00196 1.14
8 11.1 8.19 0.777 2.91 0.139 2.08 0.0108 1.51
9 14.6 11.5 0.378 3.03 0.0329 2.08 0.00219 1.48
10 12.0 10.6 0.452 1.33 0.0470 2.08 0.000623 0.656
# … with 574 more rows
[1] TRUE

Ok, so our fitted range appears to be smaller than the real values. We could probably do a bit better.
At this point we could take a look at the accuracy of our model performance, but we will only have one reference point: our testing dataset. And we haven’t done any tuning of our model or cross validation… so generally speaking you should not do this.
However if you were short on time you could continue like this.
Using the workflows method, you would need functions like this: tune::tune_*(), tune::fit_resamples() or tune::last_fit() from the tune package to get an assessment of fit.
We could stop here and use the yardstick package to evaluate performance. First we would want to use the model to predict values for the monitors in the testing data.
Using parsnip we need to use the baked data testing data.
# A tibble: 292 x 1
.pred
<dbl>
1 10.4
2 10.5
3 10.3
4 10.4
5 10.6
6 12.9
7 9.53
8 9.47
9 8.82
10 7.67
# … with 282 more rows
# A tibble: 292 x 3
.pred value fips
<dbl> <dbl> <fct>
1 10.4 11.7 1049
2 10.5 13.1 1073
3 10.3 12.2 1073
4 10.4 12.2 1089
5 10.6 11.4 1103
6 12.9 12.2 1121
7 9.53 10.9 4013
8 9.47 10.6 4021
9 8.82 14.1 4023
10 7.67 5.83 4025
# … with 282 more rows
Let’s take a look at how well it did:
Say we were done optimizing our parameters we could then use the last_fit() function of the workflows package.
However we would really want to tune parameters and use cross validation for this.
yardstick can’t talk directly to workflows… need tune for that…